Agregar procesamiento automático de referencias y soporte markuplib para DOCX by eduranm · Pull Request #60 · scieloorg/markapi

eduranm · 2026-04-20T14:06:05Z

O que esse PR faz?

Agrega la base para procesar automáticamente referencias bibliográficas dentro de markup_doc e incorpora markuplib para lectura estructural de archivos DOCX.

Incluye:

registro de markuplib;
incorporación de markuplib/ con utilidades para analizar DOCX;
nuevas utilidades en markup_doc para procesar y marcar referencias;
procesar el documento cargado;
disparo automático del procesamiento desde el flujo de creación;

Onde a revisão poderia começar?

Por commits

Como este poderia ser testado manualmente?

Levantar el entorno;
Cargar un DOCX desde el flujo de markup_doc;
Verificar que el documento pase a estado PROCESSING;
Una vez terminado, revisar que las referencias se agreguen estructuradas en el documento procesado.

Algum cenário de contexto que queira dar?

Se enfoca en el procesamiento automático de referencias y en la lectura estructural del DOCX, dejando lista la base para continuar con front, texto y salida XML.

Screenshots

N/A

Quais são tickets relevantes?

#59

Referências

N/A

…l de DOCX

…kup_doc

…ión en Wagtail

Copilot

Pull request overview

Este PR agrega la base para analizar DOCX de forma estructural (vía markuplib) e integrar el procesamiento automático de referencias dentro del flujo de markup_doc, disparándolo al crear/cargar un documento.

Changes:

Registra las apps markup_doc y markuplib y agrega utilidades base para análisis estructural de DOCX.
Incorpora tareas Celery para procesar el DOCX cargado, detectar referencias y persistir el contenido procesado en el documento.
Añade hooks/admin de Wagtail para el flujo de carga y sincronización de colecciones/journals desde la API.

Reviewed changes

Copilot reviewed 16 out of 23 changed files in this pull request and generated 14 comments.

Show a summary per file

File	Description
`model_ai/llama.py`	Ajuste del flujo Gemini en `LlamaService` (incluye pausa fija tras respuesta).
`markuplib/function_docx.py`	Nuevas utilidades para abrir y extraer contenido/estructura desde DOCX.
`markuplib/__init__.py`	Inicialización del paquete `markuplib`.
`markup_doc/wagtail_hooks.py`	ViewSets y hooks Wagtail para carga/edición y disparo del procesamiento automático.
`markup_doc/tests.py`	Archivo de tests (placeholder).
`markup_doc/tasks.py`	Tarea Celery para procesar el DOCX y estructurar contenido + referencias.
`markup_doc/sync_api.py`	Sincronización de colecciones y journals desde SciELO Core API.
`markup_doc/models.py`	Modelos y StreamFields para persistir front/body/back y metadatos.
`markup_doc/migrations/__init__.py`	Inicialización del módulo de migraciones.
`markup_doc/migrations/0001_initial.py`	Migración inicial para los modelos de `markup_doc`.
`markup_doc/migrations/0002_alter_articledocx_estatus_and_more.py`	Ajuste de campos/choices para `estatus`.
`markup_doc/marker.py`	Utilidades para marcado vía LLM (artículo/referencias).
`markup_doc/labeling_utils.py`	Utilidades de segmentación, extracción de citas APA y mapeo/etiquetado.
`markup_doc/forms.py`	Base de formulario (placeholder).
`markup_doc/choices.py`	Choices/estructura base de etiquetas y reglas de orden.
`markup_doc/apps.py`	AppConfig de `markup_doc`.
`markup_doc/admin.py`	Admin Django (placeholder).
`markup_doc/__init__.py`	Inicialización del paquete `markup_doc`.
`fixtures/e14790.docx`	DOCX de ejemplo para pruebas manuales.
`config/settings/base.py`	Registro de `markup_doc` y `markuplib` en `INSTALLED_APPS`.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-04-24T14:18:18Z

+    if model.name_file:
+        user = User.objects.get(pk=user_id)
+        refresh = RefreshToken.for_user(user)
+        access_token = refresh.access_token
+
+        #url = "http://172.17.0.1:8400/api/v1/mix_citation/reference/"
+        #url = "http://172.17.0.1:8009/api/v1/mix_citation/reference/"
+
+        # FIXME: Hardcoded URL
+        url = "http://django:8000/api/v1/reference/"    
+
+    headers = {
+        'Authorization': f'Bearer {access_token}',
+        'Content-Type': 'application/json'
+    }
+
+    response = requests.post(url, json=payload, headers=headers)
+


In process_reference(), access_token and url are only set inside if model.name_file:, but headers and requests.post() run unconditionally. If name_file is blank (e.g., using a remote API), this will raise UnboundLocalError. Initialize url/access_token for both branches or return/raise when the required config is missing.

Copilot · 2026-04-24T14:18:19Z

+def match_section(item, sections):
+    return {'label': '<sec>', 'body': True} if (
+        item.get('font_size') == sections[0].get('size') and
+        item.get('bold') == sections[0].get('bold') and
+        item.get('text', '').isupper() == sections[0].get('isupper')
+    ) else None
+
+
+def match_subsection(item, sections):
+    return {'label': '<sub-sec>', 'body': True} if (
+        item.get('font_size') == sections[1].get('size') and
+        item.get('bold') == sections[1].get('bold') and
+        item.get('text', '').isupper() == sections[1].get('isupper')
+    ) else None


match_section()/match_subsection() index sections[0] and sections[1] without checking length. If sections has fewer than 2 entries (common for short/simple documents), this will raise IndexError. Add guards (e.g., if len(sections) > 0/1) before indexing.

Copilot · 2026-04-24T14:18:19Z

+    if not result:
+        result = {'label': '<p>', 'body': state['body'], 'back': state['back']}
+        state['label'] = result.get('label')
+        state['body'] = result.get('body')
+        state['back'] = result.get('back')
+
+    if result:
+        pass
+    else:
+        if state.get('label_next'):
+            if state.get('repeat'):
+                result = match_by_regex(item.get('text'), order_labels)
+                if result:
+                    state['label'] = result[0]
+                else:
+                    result = match_by_style_and_size(item, order_labels, style='bold')
+                    if result:
+                        state['label'] = result[0]
+                        state['repeat'] = None
+                        state['reset'] = None
+                        state['label_next'] = result[1].get("next")
+                        state['body'] = result[1].get("size") == 16
+                        if state['body'] and re.search(r"^(refer)", item.get('text').lower()):
+                            state['body'] = False
+                            state['back'] = True
+            if not result:
+                result = match_next_label(item, state['label_next'], order_labels)
+                if result:
+                    state['label'] = result[0]
+                    state['label_next_reset'] = result[1].get("next")
+                    state['reset'] = result[1].get("reset", False)
+                    state['repeat'] = result[1].get("repeat", False)
+        else:
+            result = match_by_style_and_size(item, order_labels, style='bold')
+            if result:
+                state['label'] = result[0]
+                state['label_next'] = result[1].get("next")
+                if state.get('body') and re.search(r"^(refer)", item.get('text').lower()):
+                    state['body'] = False
+                    state['back'] = True
+            else:
+                result = match_by_style_and_size(item, order_labels, style='italic')
+                if result:
+                    state['label'] = re.sub(r"-\d+", "", result[0])
+                    state['label_next'] = result[1].get("next")
+                else:
+                    result = match_by_regex(item.get('text'), order_labels)
+                    if result:
+                        state['label'] = result[0]
+                    else:
+                        result = match_paragraph(item, order_labels)
+                        if result:
+                            state['label'] = result[0]


In create_labeled_object2(), result is forced to a non-empty dict at line 700 and then the else: branch (which contains most of the labeling logic) becomes unreachable because of if result: pass. This makes the function effectively label everything as <p> unless it matches the section/subsection checks. Rework the control flow so the detailed matching logic can run when appropriate.

Copilot · 2026-04-24T14:18:19Z

+            obj['type'] = 'aff_paragraph'
+
+    if re.search(r"^(translation)", item.get('text').lower()):
+        state['label'] = '<translate-fron>'


state['label'] = '<translate-fron>' looks like a typo (missing 't') and will produce a label that doesn't match the choices (<translate-front>). Use the correct label string so downstream logic can recognize it.

Suggested change

state['label'] = '<translate-fron>'

state['label'] = '<translate-front>'

Copilot · 2026-04-24T14:18:20Z

+      response_gemini = model.generate_content(user_input).text
+      time.sleep(15)
+      return response_gemini


time.sleep(15) after every Gemini call will throttle all reference processing and can tie up Celery workers even when the request succeeds. Consider removing the unconditional sleep and instead implement retry/backoff only when Gemini returns rate-limit/transient errors (e.g., 429/503), ideally with jitter.

Copilot · 2026-04-24T14:18:21Z

+    def update(cls, title, estatus):
+        try:
+            obj = cls.get(title=title)
+        except (cls.DoesNotExist, ValueError):
+            pass
+
+        obj.estatus = estatus
+        obj.save()
+        return obj


In update(), if get() raises DoesNotExist, the exception is swallowed and obj is left undefined, but the code still tries to set obj.estatus. Either re-raise/return early when not found, or create the object as appropriate.

Copilot · 2026-04-24T14:18:22Z

+                if is_numPr:
+                    numPr = paragraph.find('.//w:numPr', namespaces=paragraph.nsmap)
+                    numId = numPr.find('.//w:numId', namespaces=paragraph.nsmap).get(namespaces_p + 'val')
+                    type = [(key, objt) for key, objt in list_types.items() if objt['numId'] == numId]
+
+                    #Es una lista diferente
+                    if numId != current_num_id:
+                        current_num_id = numId
+                        if len(current_list) > 0:
+                            current_list.append('[/list]')
+                            objl = {}
+                            objl['type'] = 'list'
+                            objl['list'] = '\n'.join(current_list)
+                            current_list = []
+                            content.append(objl)
+                        list_type = 'bullet'
+                        if type[0][1][str(0)] == 'decimal':
+                            list_type = 'order'


extract_numbering_info() can return None when word/numbering.xml is missing, but extractContent() unconditionally does list_types.items() and later indexes type[0]. This will raise at runtime for DOCX files without numbering or with unexpected numId mappings; handle list_types is None and the empty-match case before using it.

Copilot · 2026-04-24T14:18:22Z

+                        else:
+                            obj['spacing'] = False
+
+                        clean_text = clean_labels(child.text)


clean_labels(child.text) will fail when child.text is None (common for <w:r> elements which usually contain <w:t> children). This will raise a TypeError in re.sub. Extract text from the run's <w:t> nodes (or guard against None) before calling clean_labels.

Suggested change

clean_text = clean_labels(child.text)

run_text_nodes = child.xpath('.//w:t/text()', namespaces=child.nsmap)

raw_text = ''.join(run_text_nodes) if run_text_nodes else (child.text or '')

clean_text = clean_labels(raw_text)

Copilot · 2026-04-24T14:18:22Z

+def get_labels(title, user_id):
+    article_docx = UploadDocx.objects.get(title=title)


UploadDocx.objects.get(title=title) relies on non-unique titles and can raise MultipleObjectsReturned / select the wrong row. Prefer passing a primary key to the task and fetching by pk.

Suggested change

def get_labels(title, user_id):

article_docx = UploadDocx.objects.get(title=title)

def get_labels(upload_docx_id, user_id):

article_docx = UploadDocx.objects.get(pk=upload_docx_id)

Copilot · 2026-04-24T14:18:23Z

+    # FIXME: This function always fetches the first LlamaModel instance.
+    model_ai = LlamaModel.objects.first()
+
+    if model_ai.api_key_gemini:


get_llm_model_name() assumes a LlamaModel row always exists; if the table is empty, model_ai will be None and model_ai.api_key_gemini will raise. Guard with if model_ai and model_ai.api_key_gemini: (and decide on a sensible default when it is None).

Suggested change

if model_ai.api_key_gemini:

if model_ai and model_ai.api_key_gemini:

eduranm added 8 commits April 16, 2026 13:52

Agregar la app markup_doc para gestión y marcación de documentos DOCX

2614d27

Agregar archivos DOCX de ejemplo para pruebas del flujo de marcación

e5a6783

Registrar markup_doc

3ddf5ca

Registrar markuplib y agregar utilidades base para lectura estructura…

a4d2fa6

…l de DOCX

Agregar utilidades de etiquetado y marcación automática de referencias

adbb38d

Integrar procesamiento automático de referencias en las tareas de mar…

b3a5141

…kup_doc

Disparar el procesamiento automático del DOCX desde el flujo de creac…

fc28a0b

…ión en Wagtail

Ajustar la respuesta de Gemini en el servicio de inferencia de model_ai

40ee0cb

robertatakenaka requested a review from Copilot April 24, 2026 14:13

Copilot started reviewing on behalf of robertatakenaka April 24, 2026 14:13 View session

Copilot AI reviewed Apr 24, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Agregar procesamiento automático de referencias y soporte markuplib para DOCX#60

Agregar procesamiento automático de referencias y soporte markuplib para DOCX#60
eduranm wants to merge 8 commits intoscieloorg:mainfrom
eduranm:issue-03

eduranm commented Apr 20, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Apr 24, 2026

Uh oh!

Copilot AI Apr 24, 2026

Uh oh!

Copilot AI Apr 24, 2026

Uh oh!

Copilot AI Apr 24, 2026

Uh oh!

Copilot AI Apr 24, 2026

Uh oh!

Copilot AI Apr 24, 2026

Uh oh!

Copilot AI Apr 24, 2026

Uh oh!

Copilot AI Apr 24, 2026

Uh oh!

Copilot AI Apr 24, 2026

Uh oh!

Copilot AI Apr 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

	state['label'] = '<translate-fron>'
	state['label'] = '<translate-front>'

-                        clean_text = clean_labels(child.text)
+                        run_text_nodes = child.xpath('.//w:t/text()', namespaces=child.nsmap)
+                        raw_text = ''.join(run_text_nodes) if run_text_nodes else (child.text or '')
+                        clean_text = clean_labels(raw_text)

		def get_labels(title, user_id):
		article_docx = UploadDocx.objects.get(title=title)

	if model_ai.api_key_gemini:
	if model_ai and model_ai.api_key_gemini:

Conversation

eduranm commented Apr 20, 2026

O que esse PR faz?

Onde a revisão poderia começar?

Como este poderia ser testado manualmente?

Algum cenário de contexto que queira dar?

Screenshots

Quais são tickets relevantes?

Referências

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Apr 24, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 24, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 24, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 24, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 24, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 24, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 24, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 24, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 24, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 24, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants